1 Introduction

This workflow reanalyze the adult single nucleus RNA-seq data produced by (Wang et al. 2018), including adult using single-nucleus droplet-based sequencing (snDrop-seq) from PFC (Lake et al. 2018).

In Page 7 of the supplementary materials of (Wang et al. 2018):

For the UMI count-based dataset, we integrated the PsychENCODE adult single-cell profiles for 17,093 cells from dorsolateral prefrontal cortex (DFC) with the published 10,319 adult single-cell data from PFC (20). The integrated UMI dataset includes nine excitatory, ten inhibitory and six non-neuronal cell types (astrocytes, endothelial cells, microglia, oligodendrocytes, OPCs, and pericytes), and a newly discovered excitatory neuronal type (Ex9).

1.1 Libraries and directories

We will first load a few libraries. Among them, * DropletUtils provides functions for data from droplet technologies such as 10X Genomics. * biomaRt provides easy access to databases, such as Ensembl, COSMIC, Uniprot, HGNC, etc. * scater is a collection of tools for doing quality control analyses of scRNA-seq * scran provide functions for normalization of cell-specific libary sizes, correcting batch effects, and identification marker genes

bio_packs = c("SingleCellExperiment","DropletUtils","biomaRt",
              "scater","scran","limma","org.Hs.eg.db")
source("https://bioconductor.org/biocLite.R")
for(pack1 in bio_packs){
  if( !pack1 %in% installed.packages()[,"Package"]){
    biocLite(pack1, suppressUpdates = TRUE)
  }
}

cran_packs = c("data.table","svd","Rtsne","stringi","irlba")
for(pack1 in cran_packs){
  if( !pack1 %in% installed.packages()[,"Package"]){
    install.packages(pack1)
  }
}

library(data.table)
library(SingleCellExperiment)
library(DropletUtils)
library(biomaRt)
library(scater)
library(scran)
library(limma)
library(ggplot2)
repo_dir  = "~/scRNAseq_pipelines"
work_dir  = file.path(repo_dir,"psychENCODE")
psychENCODE_dir = "~/psychENCODE_data"

2 Obtaining/Loading Counts

Next we import in the count data and other available information. The dataset is available here.

local_counts_fn  = file.path(psychENCODE_dir, "DER-22_Single_cell_expression_raw_UMI.tsv")
online_counts_fn = "http://resource.psychencode.org/Datasets/Derived/SC_Decomp/DER-22_Single_cell_expression_raw_UMI.tsv"
if (file.exists(local_counts_fn)) {
  counts = fread(local_counts_fn, data.table = FALSE)
} else {
  counts = fread(online_counts_fn, data.table = FALSE)
}
## Warning in fread(local_counts_fn, data.table = FALSE): Detected 27412
## column names but the data has 27413 columns (i.e. invalid file). Added 1
## extra default column name for the first column which is guessed to be row
## names or an index. Use setnames() afterwards if this guess is not correct,
## or fix the file write command that created the file to create a valid file.
dim(counts); counts[1:3,1:2]
## [1] 17176 27413
##         V1 Ex3e
## 1     A1BG    0
## 2 A1BG-AS1    0
## 3     A1CF    0
rownames(counts) = counts$V1
counts = as.matrix(counts[,-1])
colnames(counts)[1:10]
##  [1] "Ex3e"    "Ex2"     "In1b"    "Oligo"   "Ex1"     "Astro"   "Ex8"    
##  [8] "Astro.1" "Astro.2" "In4b"
# Parse cell ID. Since we used data.table() to read the table as data.table, it has automatically parsed the cell types.
# We need to parse it back as:
# Ex3e.2 -> Ex3e, V2898 -> NA
part_cell_id = sapply(colnames(counts), 
                      function(xx) strsplit(xx,".",fixed=TRUE)[[1]][1],
                      USE.NAMES=FALSE)
part_cell_id[grep("^V", part_cell_id)] = NA
table(part_cell_id, useNA = "ifany")
## part_cell_id
##     Astro      Endo       Ex1       Ex2      Ex3e       Ex4      Ex5b 
##      2815       549      5059      1428       660      1571      2665 
##      Ex6a      Ex6b       Ex8       Ex9      In1a      In1b      In1c 
##       319      1426       320       261       185       348       667 
##       In3      In4a      In4b      In6a      In6b       In7       In8 
##       629       249       614       272      1432       305      1344 
## Microglia     Oligo       OPC       Per      <NA> 
##       317      2657      1243        45        32
col_dat = data.frame(sample_name = colnames(counts), 
                     part_cell_id = part_cell_id,
                     stringsAsFactors = FALSE)
col_dat[1:5,]
##   sample_name part_cell_id
## 1        Ex3e         Ex3e
## 2         Ex2          Ex2
## 3        In1b         In1b
## 4       Oligo        Oligo
## 5         Ex1          Ex1

We create sce object. We also check for spike-ins.

sce = SingleCellExperiment(assays = list(counts = counts),colData = col_dat)
rm(counts)  # save some memory
sce
## class: SingleCellExperiment 
## dim: 17176 27412 
## metadata(0):
## assays(1): counts
## rownames(17176): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(27412): Ex3e Ex2 ... Microglia.315 Microglia.316
## colData names(2): sample_name part_cell_id
## reducedDimNames(0):
## spikeNames(0):
rownames(sce)[grep("^ERCC",rownames(sce))]
## [1] "ERCC1"   "ERCC2"   "ERCC3"   "ERCC4"   "ERCC6"   "ERCC6L"  "ERCC6L2"
## [8] "ERCC8"

3 Pre-processing and Quality Control

3.1 Gene Annotation

We will extract annotation information based on gene names.

anno_file = file.path(work_dir,"gene.annotation.rds")
if( file.exists(anno_file) ){
  gene_anno = readRDS(anno_file)
} else{
  ensembl = useEnsembl(biomart="ensembl",
                       dataset="hsapiens_gene_ensembl")
  
  attr_string = c("hgnc_symbol","ensembl_gene_id","external_gene_name",
                  "chromosome_name", "start_position","end_position","strand",
                  "description","percentage_gene_gc_content","gene_biotype")
  
  gene_anno = getBM(attributes = attr_string,
                    filters = "external_gene_name",
                    values = rownames(sce),
                    mart = ensembl)
  saveRDS(gene_anno, file=anno_file)
}

dim(sce); dim(gene_anno)
## [1] 17176 27412
## [1] 18551    10

We remove those genes that are part of extracted annotation, but aren’t in sce.

w2rm = which(!gene_anno$external_gene_name %in% rownames(sce))
w2rm
## integer(0)
gene_anno[w2rm,]
##  [1] hgnc_symbol                ensembl_gene_id           
##  [3] external_gene_name         chromosome_name           
##  [5] start_position             end_position              
##  [7] strand                     description               
##  [9] percentage_gene_gc_content gene_biotype              
## <0 rows> (or 0-length row.names)
if (length(w2rm) != 0) gene_anno = gene_anno[-w2rm,]
dim(sce); dim(gene_anno)
## [1] 17176 27412
## [1] 18551    10

Many genes have multiple entries in the annotation, often because they are annotated to scaffolds, assembly patches and alternate loci. Here we simply remove such entries. The we remove duplicated annotations and genes without annotations.

table(gene_anno$chromosome_name)[1:30]
## 
##                1               10               11               12 
##             1651              697              907              853 
##               13               14               15               16 
##              375              500              551              699 
##               17               18               19                2 
##              938              257             1114             1222 
##               20               21               22                3 
##              469              187              400             1041 
##                4                5                6                7 
##              668              779              856              845 
##                8                9    CHR_HG1_PATCH  CHR_HG107_PATCH 
##              635              665               20                1 
##  CHR_HG126_PATCH CHR_HG1311_PATCH CHR_HG1362_PATCH CHR_HG1815_PATCH 
##                1                2                6                8 
## CHR_HG1832_PATCH CHR_HG2002_PATCH 
##                5                6
chr_nms = c(1:22,"X","Y","MT")
gene_anno = gene_anno[which(gene_anno$chromosome_name %in% chr_nms),]
dim(sce); dim(gene_anno)
## [1] 17176 27412
## [1] 16971    10
t1 = table(gene_anno$external_gene_name)
t2 = sort(t1[t1 > 1], decreasing=TRUE)
length(t2)
## [1] 15
t2[1:10]
## 
##       ABCF2       ATXN7      CCDC39      DIABLO  DNAJC9-AS1     GOLGA8M 
##           2           2           2           2           2           2 
##      HSPA14 KBTBD11-OT1   LINC01115   LINC01505 
##           2           2           2           2
gene_anno[which(gene_anno$external_gene_name %in% names(t2)[1:4]), 1:4]
##      hgnc_symbol ensembl_gene_id external_gene_name chromosome_name
## 124        ABCF2 ENSG00000033050              ABCF2               7
## 125        ABCF2 ENSG00000285292              ABCF2               7
## 1792       ATXN7 ENSG00000163635              ATXN7               3
## 1793       ATXN7 ENSG00000285258              ATXN7               3
## 2380      CCDC39 ENSG00000145075             CCDC39               3
## 2381             ENSG00000284862             CCDC39               3
## 4041      DIABLO ENSG00000184047             DIABLO              12
## 4042      DIABLO ENSG00000284934             DIABLO              12
w_duplicate = which(gene_anno$external_gene_name %in% names(t2))
ganno2 = gene_anno[w_duplicate,]
dim(ganno2)
## [1] 30 10
table(ganno2$hgnc_symbol == ganno2$external_gene_name)
## 
## FALSE  TRUE 
##    10    20
ganno2 = ganno2[which(ganno2$hgnc_symbol == ganno2$external_gene_name),]
dim(ganno2)
## [1] 20 10
ganno2 = dplyr::distinct(ganno2,external_gene_name,.keep_all = TRUE)
dim(ganno2)
## [1] 15 10
gene_anno = rbind(gene_anno[-w_duplicate,], ganno2)
dim(gene_anno)
## [1] 16956    10
table(gene_anno$gene_biotype)
## 
##           3prime_overlapping_ncRNA                          antisense 
##                                  4                                708 
##      bidirectional_promoter_lncRNA                            lincRNA 
##                                 12                                641 
##                       macro_lncRNA                              miRNA 
##                                  1                                  1 
##                           misc_RNA             polymorphic_pseudogene 
##                                  5                                  7 
##               processed_pseudogene               processed_transcript 
##                                327                                131 
##                     protein_coding                              scRNA 
##                              14600                                  1 
##                  sense_overlapping                                TEC 
##                                 11                                  4 
##                          TR_C_gene   transcribed_processed_pseudogene 
##                                  3                                 79 
##     transcribed_unitary_pseudogene transcribed_unprocessed_pseudogene 
##                                 56                                245 
##                 unitary_pseudogene             unprocessed_pseudogene 
##                                  7                                113
gene_missing = setdiff(rownames(sce),gene_anno$external_gene_name)
gene_missing[1:10]
##  [1] "AAED1"      "AATK-AS1"   "AC004448.5" "AC004924.1" "AC005154.6"
##  [6] "AC005550.3" "AC007365.3" "AC007386.2" "AC007966.1" "AC008074.3"
length(gene_missing)
## [1] 220
w2kp = match(gene_anno$external_gene_name,rownames(sce))
table(is.na(w2kp))
## 
## FALSE 
## 16956
gene_anno$external_gene_name[which(is.na(w2kp))]
## character(0)
sce = sce[w2kp,]

dim(sce)
## [1] 16956 27412
rowData(sce) = gene_anno

sce
## class: SingleCellExperiment 
## dim: 16956 27412 
## metadata(0):
## assays(1): counts
## rownames(16956): ABCA7 AC009134.1 ... TMSB15B ZNF883
## rowData names(10): hgnc_symbol ensembl_gene_id ...
##   percentage_gene_gc_content gene_biotype
## colnames(27412): Ex3e Ex2 ... Microglia.315 Microglia.316
## colData names(2): sample_name part_cell_id
## reducedDimNames(0):
## spikeNames(0):
rowData(sce)[1:5,]
## DataFrame with 5 rows and 10 columns
##            hgnc_symbol ensembl_gene_id external_gene_name chromosome_name
##            <character>     <character>        <character>     <character>
## ABCA7            ABCA7 ENSG00000064687              ABCA7              19
## AC009134.1             ENSG00000262116         AC009134.1              16
## AAMDC            AAMDC ENSG00000087884              AAMDC              11
## AC116050.1             ENSG00000278131         AC116050.1               2
## AC144836.1             ENSG00000262213         AC144836.1              17
##            start_position end_position    strand
##                 <integer>    <integer> <integer>
## ABCA7             1040101      1065572         1
## AC009134.1       13197606     13204907        -1
## AAMDC            77821109     77918432         1
## AC116050.1       91589494     91621421         1
## AC144836.1        1241939      1254000        -1
##                                                                                     description
##                                                                                     <character>
## ABCA7                ATP binding cassette subfamily A member 7 [Source:HGNC Symbol;Acc:HGNC:37]
## AC009134.1                                                novel transcript, antisense to SHISA9
## AAMDC      adipogenesis associated Mth938 domain containing [Source:HGNC Symbol;Acc:HGNC:30205]
## AC116050.1                                                       otopetrin 1 (OTOP1) pseudogene
## AC144836.1                                                                     novel transcript
##            percentage_gene_gc_content           gene_biotype
##                             <numeric>            <character>
## ABCA7                           60.49         protein_coding
## AC009134.1                      42.37              antisense
## AAMDC                           41.82         protein_coding
## AC116050.1                      45.86 unprocessed_pseudogene
## AC144836.1                      49.61                lincRNA

3.2 Identify Low quality cells

3.2.1 barcodeRanks filtering

Please refer to this workflow in bioconductor for reference.

bcrank = barcodeRanks(counts(sce))
# Only show unique points for plotting speed.
uniq = !duplicated(bcrank$rank)

par(mar=c(5,4,2,1), bty="n")
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy", 
     xlab="Rank", ylab="Total UMI count", cex=0.5, cex.lab=1.2)
abline(h=bcrank$inflection, col="darkgreen", lty=2,lwd=2)
abline(h=bcrank$knee, col="dodgerblue", lty=2,lwd=2)

legend("left", legend=c("Inflection", "Knee"), bty="n", 
       col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2,lwd=2)

bcrank$inflection
## Per.9 
##   330
bcrank$knee
## Oligo.924 
##      1615
summary(bcrank$total)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     324    1101    2152    3158    4371   17392
table(bcrank$total >= bcrank$knee)
## 
## FALSE  TRUE 
## 10240 17172
table(bcrank$total >= bcrank$inflection)
## 
## FALSE  TRUE 
##     4 27408
set.seed(100)
date()
## [1] "Sun Feb 24 17:57:52 2019"
e_out = emptyDrops(counts(sce))
date()
## [1] "Sun Feb 24 17:58:16 2019"
e_out
## DataFrame with 27412 rows and 5 columns
##                   Total   LogProb    PValue   Limited       FDR
##               <integer> <numeric> <numeric> <logical> <numeric>
## Ex3e               9224       NaN         1     FALSE         0
## Ex2                5460       NaN         1     FALSE         0
## In1b               5436       NaN         1     FALSE         0
## Oligo              1508       NaN         1     FALSE         1
## Ex1                5715       NaN         1     FALSE         0
## ...                 ...       ...       ...       ...       ...
## Microglia.312       410       NaN         1     FALSE         1
## Microglia.313       484       NaN         1     FALSE         1
## Microglia.314       422       NaN         1     FALSE         1
## Microglia.315       416       NaN         1     FALSE         1
## Microglia.316       706       NaN         1     FALSE         1
length(unique(e_out$FDR))
## [1] 2
table(e_out$FDR)
## 
##     0     1 
## 17172 10240
tapply(e_out$Total, e_out$FDR, summary)
## $`0`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1615    2347    3654    4525    5846   17392 
## 
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   324.0   472.0   731.0   865.7  1315.0  1614.0

From the above analysis, some cells with very small number of UMIs. Here we chose do not remove any cells.

3.2.2 Incorporate information of mitochondira/ribosomal genes in QC metrics

We will generate a set of QC features per cell, including the expression of mitochondira genes (there is none here) or ribosomal genes. We identify ribosomal genes based on annotation from https://www.genenames.org/.

ribo_fn = file.path(work_dir,"ribosome_genes.txt")
ribo    = read.table(ribo_fn,sep='\t',header=TRUE,stringsAsFactors= FALSE)
ribo[1:2,]
##   HGNC.ID Approved.Symbol          Approved.Name   Status Previous.Symbols
## 1   10298           RPL10  ribosomal protein L10 Approved                 
## 2   10299          RPL10A ribosomal protein L10a Approved            NEDD6
##                                  Synonyms Chromosome Accession.Numbers
## 1 NOV, QM, DXS648E, DXS648, FLJ23544, L10       Xq28          AB007170
## 2                            Csa-19, L10A    6p21.31            U12404
##   RefSeq.IDs Gene.Family.Tag Gene.family.description Gene.family.ID
## 1  NM_006013             RPL    L ribosomal proteins            729
## 2  NM_007104             RPL    L ribosomal proteins            729
is_mito = which(rowData(sce)$chromosome_name == "MT")
is_ribo = which(rowData(sce)$external_gene_name %in% ribo$Approved.Symbol)
length(is_mito)
## [1] 0
length(is_ribo)
## [1] 151
sce = calculateQCMetrics(sce, feature_controls=list(Ri=is_ribo))
sort(colnames(colData(sce)))
##  [1] "is_cell_control"                               
##  [2] "log10_total_counts"                            
##  [3] "log10_total_counts_endogenous"                 
##  [4] "log10_total_counts_feature_control"            
##  [5] "log10_total_counts_Ri"                         
##  [6] "log10_total_features_by_counts"                
##  [7] "log10_total_features_by_counts_endogenous"     
##  [8] "log10_total_features_by_counts_feature_control"
##  [9] "log10_total_features_by_counts_Ri"             
## [10] "part_cell_id"                                  
## [11] "pct_counts_endogenous"                         
## [12] "pct_counts_feature_control"                    
## [13] "pct_counts_in_top_100_features"                
## [14] "pct_counts_in_top_100_features_endogenous"     
## [15] "pct_counts_in_top_100_features_feature_control"
## [16] "pct_counts_in_top_100_features_Ri"             
## [17] "pct_counts_in_top_200_features"                
## [18] "pct_counts_in_top_200_features_endogenous"     
## [19] "pct_counts_in_top_200_features_feature_control"
## [20] "pct_counts_in_top_200_features_Ri"             
## [21] "pct_counts_in_top_50_features"                 
## [22] "pct_counts_in_top_50_features_endogenous"      
## [23] "pct_counts_in_top_50_features_feature_control" 
## [24] "pct_counts_in_top_50_features_Ri"              
## [25] "pct_counts_in_top_500_features"                
## [26] "pct_counts_in_top_500_features_endogenous"     
## [27] "pct_counts_in_top_500_features_feature_control"
## [28] "pct_counts_in_top_500_features_Ri"             
## [29] "pct_counts_Ri"                                 
## [30] "sample_name"                                   
## [31] "total_counts"                                  
## [32] "total_counts_endogenous"                       
## [33] "total_counts_feature_control"                  
## [34] "total_counts_Ri"                               
## [35] "total_features_by_counts"                      
## [36] "total_features_by_counts_endogenous"           
## [37] "total_features_by_counts_feature_control"      
## [38] "total_features_by_counts_Ri"
par(mfrow=c(1,3), mar=c(5, 4, 1, 1), bty="n")
hist(log10(sce$total_counts), xlab="log10(Library sizes)", main="", 
     breaks=20, col="grey80", ylab="Number of cells")
hist(sce$log10_total_features_by_counts, xlab="log10(# of expressed genes)", 
     main="", breaks=20, col="grey80", ylab="Number of cells")
hist(sce$pct_counts_Ri, xlab="Ribosome prop. (%)",
     ylab="Number of cells", breaks=40, main="", col="grey80")

# hist(sce$pct_counts_Mt, xlab="Mitochondrial prop. (%)", 
#      ylab="Number of cells", breaks=80, main="", col="grey80")

par(mfrow=c(1,3), mar=c(5, 4, 1, 1), bty="n")
smoothScatter(log10(sce$total_counts), sce$log10_total_features_by_counts, 
              xlab="log10(Library sizes)", ylab="log10(# of expressed genes)")
smoothScatter(log10(sce$total_counts), sce$pct_counts_Ri,
              xlab="log10(Library sizes)", ylab="Ribosome prop. (%)")
# smoothScatter(log10(sce$total_counts), sce$pct_counts_Mt,
#               xlab="log10(Library sizes)", ylab="Mitochondrial prop. (%)")
# smoothScatter(sce$pct_counts_Ri,sce$pct_counts_Mt,
#               xlab="Ribosome prop. (%)", ylab="Mitochondrial prop. (%)")

From the QC results, we will filter on the metrics by identifying outliers using isOutlier.

libsize_drop = isOutlier(sce$total_counts,nmads = 3,type = "lower",log = TRUE)
feature_drop = isOutlier(sce$total_features_by_counts,nmads = 3,
                         type = "lower",log = TRUE)
mito_drop = rep(FALSE, nrow(sce))
ribo_drop = isOutlier(sce$pct_counts_Ri,nmads = 3,type = "higher")
keep = !(libsize_drop | feature_drop | mito_drop | ribo_drop)
## Warning in libsize_drop | feature_drop | mito_drop: longer object length is
## not a multiple of shorter object length
data.frame(ByLibSize = sum(libsize_drop),ByFeature = sum(feature_drop),
         ByMito = sum(mito_drop),ByRibo = sum(ribo_drop),
         Remaining = sum(keep))
##   ByLibSize ByFeature ByMito ByRibo Remaining
## 1         0         0      0    998     26414
table(colData(sce)$part_cell_id,keep)
##            keep
##             FALSE TRUE
##   Astro        14 2801
##   Endo        410  139
##   Ex1          24 5035
##   Ex2           1 1427
##   Ex3e         73  587
##   Ex4           7 1564
##   Ex5b         21 2644
##   Ex6a          1  318
##   Ex6b          8 1418
##   Ex8          32  288
##   Ex9         137  124
##   In1a          5  180
##   In1b         26  322
##   In1c         55  612
##   In3          34  595
##   In4a          4  245
##   In4b          3  611
##   In6a          0  272
##   In6b          7 1425
##   In7          23  282
##   In8          38 1306
##   Microglia    17  300
##   Oligo        48 2609
##   OPC           3 1240
##   Per           6   39
sce = sce[,keep]
dim(sce)
## [1] 16956 26414

3.3 Summarize gene level information

rowData(sce)[1:2,]
## DataFrame with 2 rows and 18 columns
##            hgnc_symbol ensembl_gene_id external_gene_name chromosome_name
##            <character>     <character>        <character>     <character>
## ABCA7            ABCA7 ENSG00000064687              ABCA7              19
## AC009134.1             ENSG00000262116         AC009134.1              16
##            start_position end_position    strand
##                 <integer>    <integer> <integer>
## ABCA7             1040101      1065572         1
## AC009134.1       13197606     13204907        -1
##                                                                           description
##                                                                           <character>
## ABCA7      ATP binding cassette subfamily A member 7 [Source:HGNC Symbol;Acc:HGNC:37]
## AC009134.1                                      novel transcript, antisense to SHISA9
##            percentage_gene_gc_content   gene_biotype is_feature_control
##                             <numeric>    <character>          <logical>
## ABCA7                           60.49 protein_coding              FALSE
## AC009134.1                      42.37      antisense              FALSE
##            is_feature_control_Ri         mean_counts   log10_mean_counts
##                        <logical>           <numeric>           <numeric>
## ABCA7                      FALSE  0.0698599153655333  0.0293269160342217
## AC009134.1                 FALSE 0.00211586166642346 0.00091793627520486
##            n_cells_by_counts pct_dropout_by_counts total_counts
##                    <integer>             <numeric>    <integer>
## ABCA7                   1752      93.6086385524588         1915
## AC009134.1                58      99.7884138333577           58
##            log10_total_counts
##                     <numeric>
## ABCA7        3.28239550474253
## AC009134.1   1.77085201164214
summary(rowData(sce)$mean_counts)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.001715  0.017401  0.068984  0.186243  0.181089 20.989494
summary(rowData(sce)$mean_counts[rowData(sce)$mean_counts>0])
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.001715  0.017401  0.068984  0.186243  0.181089 20.989494
summary(rowData(sce)$n_cells_by_counts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      47     453    1702    2919    4048   25645
par(mfrow=c(1,3), mar=c(5,4,1,1))
hist(log10(rowData(sce)$mean_counts+1e-6), col="grey80",  main="", 
     breaks=40, xlab="log10(ave # of UMI + 1e-6)")
hist(log10(rowData(sce)$n_cells_by_counts+1), col="grey80", main="", 
     breaks=40, xlab="log10(# of expressed cells + 1)")
plot(log10(rowData(sce)$mean_counts+1e-6), pch=16, col=rgb(0,0,0,0.4), 
     log10(rowData(sce)$n_cells_by_counts + 1), 
     xlab="log10(ave # of UMI + 1e-6)", 
     ylab="log10(# of expressed cells + 1)")

tb1 = table(rowData(sce)$n_cells_by_counts)
tb1[1:11]
## 
## 47 48 49 50 51 52 53 54 55 56 57 
##  3  3  9 11  8 17 25 12 18 23 16

We remove those genes that are lowly expressed. To filter genes, we follow the threshold to remove genes with two or more UMIs in less than 10 nuclei.

Note that the variable strand need to be renamed, otherwise there is an error message that such a variabel name cannot be used.

names(rowData(sce))[names(rowData(sce)) == "strand"] = "strand_n"

n_genes = colSums(counts(sce) >= 2)
summary(n_genes)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    17.0   144.0   332.0   536.2   769.0  3170.0
table(n_genes >= 100)
## 
## FALSE  TRUE 
##  5312 21102
table(n_genes >= 200)
## 
## FALSE  TRUE 
##  8575 17839
n_genes = colSums(counts(sce) >= 1)
summary(n_genes)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   294.0   789.2  1478.5  1810.8  2612.0  6116.0
table(n_genes >= 100)
## 
##  TRUE 
## 26414
table(n_genes >= 200)
## 
##  TRUE 
## 26414
n_cells = rowSums(counts(sce) >= 2)
summary(n_cells)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    15.0   141.0   835.3   695.0 22946.0
table(n_cells >= 10)
## 
## FALSE  TRUE 
##  3567 13389
sce = sce[which(n_cells >= 10),]
dim(sce)
## [1] 13389 26414

Next we check those highly expressed genes.

par(mar=c(5,4,1,1))
od1 = order(rowData(sce)$mean_counts, decreasing = TRUE)
barplot(rowData(sce)$mean_counts[od1[20:1]], las=1, 
        names.arg=rowData(sce)$hgnc_symbol[od1[20:1]], 
        horiz=TRUE, cex.names=0.8, cex.axis=0.8, 
        xlab="ave # of UMI")

3.4 Normalization

A simple solution for normalization and stablizing expression varaince across genes is to tranform the count data by log(count/size.factor + 1). One may calcualte size.factor per cell as the total number of UMIs, and this assumes the total expression are the same across all the cells. However, the total expression of each cell may vary with respect to cell type and/or cell size, and the computeSumFactors function in R package scran provides a more sophisicated way to calculate size.factor to allow such variaation across cells (Lun, Bach, and Marioni 2016). computeSumFactors can use initial clustering of cells to normalize expression within and beetween clusters. Within a cluster, it estimates the size factor for many groups of cells so that there are more groups than cells, and then it can calcualte the size factor per cell using a lienar deconvolution system. We remove all the cells with negative or very small size factors (< 0.01).

As shown in the following plot, the final size factor estimation is indeed highly correlated with the naive definition by total count.

Finally, the command normalize(sce) adds the normalized expression into the variable sce, which can be accessed by logcounts(sce) = log2(gene_cell_count / size_factor + 1).

date()
## [1] "Sun Feb 24 17:58:33 2019"
clusters = quickCluster(sce, min.mean=0.1, method="igraph")
table(clusters)
## clusters
##    1    2    3    4    5    6    7    8    9 
##  596 1875 2300 5736  752  997 4715 1746 7697
date()
## [1] "Sun Feb 24 17:59:31 2019"
sce = computeSumFactors(sce, cluster=clusters, min.mean=0.1)
date()
## [1] "Sun Feb 24 18:03:32 2019"
summary(sizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05174 0.33275 0.68518 1.00000 1.41138 5.97528
sort(sizeFactors(sce))[1:30]
##  [1] 0.05173631 0.05645421 0.06262374 0.07171504 0.07367298 0.07448167
##  [7] 0.07763261 0.07768411 0.07838728 0.07939418 0.07966941 0.08160297
## [13] 0.08434265 0.08670765 0.08698396 0.08829519 0.08846775 0.08859333
## [19] 0.08864534 0.08887766 0.08928673 0.08934338 0.08937853 0.08988685
## [25] 0.08990635 0.08996416 0.09006206 0.09011824 0.09024464 0.09024836
dim(sce)
## [1] 13389 26414
sce = sce[,which(sizeFactors(sce) > 0.01)]
dim(sce)
## [1] 13389 26414
par(mfrow=c(1,2), mar=c(5,4,2,1), bty="n")
smoothScatter(sce$total_counts, sizeFactors(sce), log="xy", 
              xlab="total counts", ylab="size factors")
plot(sce$total_counts, sizeFactors(sce), log="xy", 
     xlab="total counts", ylab="size factors", 
     cex=0.3, pch=20, col=rgb(0.1,0.2,0.7,0.3))
abline(h=0.05)

dim(sce)
## [1] 13389 26414
sce = sce[,which(sizeFactors(sce) > 0.05)]
dim(sce)
## [1] 13389 26414
sce = normalize(sce) 

4 Dimension Reduction

For dimension reduction, such as calculating PCA or performing TSNE, we should start by identifying a subset of genes with high level of biological signal relative to background (technical) noise. These genes are referred to as HVGs (highly variable genes). The decomposeVar function from R/cran is designed for this task.

new_trend = makeTechTrend(x=sce)
fit = trendVar(sce, use.spikes=FALSE, loess.args=list(span=0.05))

par(mfrow=c(1,1), mar=c(5,4,2,1), bty="n")
plot(fit$mean, fit$var, pch=20, col=rgb(0.1,0.2,0.7,0.6), 
     xlab="log(mean)", ylab="var")
curve(fit$trend(x), col="orange", lwd=2, add=TRUE)
curve(new_trend(x), col="red", lwd=2, add=TRUE)
legend("top", legend=c("Poisson noise", "observed trend"), 
       lty=1, lwd=2, col=c("red", "orange"), bty="n")

fit$trend = new_trend

dec     = decomposeVar(fit=fit) 
top_dec = dec[order(dec$bio, decreasing=TRUE),]
plotExpression(sce, features=rownames(top_dec)[1:10])

When performing PCA, we can use all the genes or just those genes with high signal-to-noise ratio (HGVs). TSNE analysis is usually based on the top PCs rather than the original gene expression data. We first perform PCA using all the genes and the function denoisePCA can automatically select the PCs based on modeling of technical noise.

date()
## [1] "Sun Feb 24 18:04:22 2019"
sce = denoisePCA(sce, technical=new_trend, approx=TRUE)
date()
## [1] "Sun Feb 24 18:05:37 2019"
dim(reducedDim(sce, "PCA"))
## [1] 26414    24
plot(log10(attr(reducedDim(sce), "percentVar")), xlab="PC",
     ylab="log10(Prop of variance explained)", pch=20, cex=0.6, 
     col=rgb(0.8, 0.2, 0.2, 0.5))
abline(v=ncol(reducedDim(sce, "PCA")), lty=2, col="red")

df_pcs = data.frame(reducedDim(sce, "PCA"))
df_pcs$log10_total_features_by_counts = colData(sce)$log10_total_features_by_counts
df_pcs$part_cell_id = colData(sce)$part_cell_id

gp1 = ggplot(df_pcs, aes(PC1,PC2,col=log10_total_features_by_counts)) + 
  geom_point(size=0.2,alpha=0.6) + theme_classic() + 
  scale_colour_gradient(low="lightblue",high="red") +
  guides(color = guide_legend(override.aes = list(size=3)))
gp1

gp1 = ggplot(df_pcs, aes(PC1,PC2,col=part_cell_id)) + 
  geom_point(size=0.2,alpha=0.6) + theme_classic() + 
  guides(color = guide_legend(override.aes = list(size=3)))
gp1

date()
## [1] "Sun Feb 24 18:05:38 2019"
sce = runTSNE(sce, use_dimred="PCA", perplexity=30, rand_seed=100)
## Warning: 'rand.seed=' is deprecated.
## Use 'set.seed' externally instead.
date()
## [1] "Sun Feb 24 18:06:56 2019"
df_tsne = data.frame(reducedDim(sce, "TSNE"))
df_tsne$log10_total_features_by_counts = colData(sce)$log10_total_features_by_counts
df_tsne$part_cell_id = colData(sce)$part_cell_id

gp1 = ggplot(df_tsne, aes(X1,X2,col=log10_total_features_by_counts)) + 
  geom_point(size=0.2,alpha=0.6) + theme_classic() + 
  scale_colour_gradient(low="lightblue",high="red") +
  guides(color = guide_legend(override.aes = list(size=3)))
gp1

gp1 = ggplot(df_tsne, aes(X1,X2,col=part_cell_id)) + 
  geom_point(size=0.2,alpha=0.6) + theme_classic() + 
  guides(color = guide_legend(override.aes = list(size=3)))
gp1

We select around top 2,500 genes (based on FDR and biological residual thresholds) as HVGs (highly variable genes).

summary(dec$bio)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.050635 -0.003706  0.001581  0.020546  0.013081  3.103550
dec1 = dec
dec1$bio[which(dec$bio < 1e-5)] = 1e-5
dec1$FDR[which(dec$FDR < 1e-100)] = 1e-100

par(mfrow=c(1,2))
hist(log10(dec1$bio),breaks=100,main="")
hist(log10(dec1$FDR),breaks=100,main="")

summary(dec$FDR[dec$bio > 0.01])
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0000000 0.0000000 0.0001085 0.0000000 0.0437114
summary(dec$FDR[dec$bio > 0.03])
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 0.000e+00 1.266e-09 0.000e+00 2.109e-06
table(dec$FDR < 1e-10,dec$bio > 0.03)
##        
##         FALSE TRUE
##   FALSE  8327   23
##   TRUE   3132 1907
table(dec$FDR < 1e-10,dec$bio > 0.01)
##        
##         FALSE TRUE
##   FALSE  7944  406
##   TRUE   1539 3500
table(dec$FDR < 1e-10,dec$bio > 0.02)
##        
##         FALSE TRUE
##   FALSE  8256   94
##   TRUE   2558 2481
w2kp = which(dec$FDR < 1e-10 & dec$bio > 0.02)
sce_hvg = sce[w2kp,]
sce_hvg
## class: SingleCellExperiment 
## dim: 2481 26414 
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(2481): ADAMTS17 ABCG1 ... ZRANB2 ZNF536
## rowData names(18): hgnc_symbol ensembl_gene_id ... total_counts
##   log10_total_counts
## colnames(26414): Ex3e Ex2 ... Microglia.315 Microglia.316
## colData names(38): sample_name part_cell_id ...
##   pct_counts_in_top_200_features_Ri
##   pct_counts_in_top_500_features_Ri
## reducedDimNames(2): PCA TSNE
## spikeNames(0):
edat = t(as.matrix(logcounts(sce_hvg)))
edat = scale(edat)
dim(edat)
## [1] 26414  2481
edat[1:2,1:3]
##        ADAMTS17      ABCG1     ADARB2
## Ex3e -0.3253376 -0.1980436 -0.4633208
## Ex2  -0.3253376 -0.1980436 -0.4633208

Perform PCA and use the top 50 PCs for TSNE projection. When calculating PCs, we use log-transformed normalized gene expression data: log2(norm_express+1).

library(svd)
library(Rtsne)

date()
## [1] "Sun Feb 24 18:07:05 2019"
ppk = propack.svd(edat,neig=50)
date()
## [1] "Sun Feb 24 18:07:16 2019"
pca = t(ppk$d*t(ppk$u)) # calculates pc scores aka principal components

tmp_df = data.frame(pca[,1:2])
names(tmp_df) = paste0("HVG_PC",seq(ncol(tmp_df)))

df_hvg = data.frame(colData(sce)[,c("sample_name","part_cell_id",
                                  "log10_total_features_by_counts")],tmp_df)
rownames(df_hvg) = NULL

set.seed(100)
date()
## [1] "Sun Feb 24 18:07:16 2019"
tsne = Rtsne(pca, pca = FALSE)
date()
## [1] "Sun Feb 24 18:08:47 2019"
tmp_df = data.frame(tsne$Y)
names(tmp_df) = paste0("HVG_TSNE",seq(ncol(tmp_df)))
df_hvg = data.frame(df_hvg,tmp_df)

gp1 = ggplot(df_hvg, aes(HVG_TSNE1,HVG_TSNE2,col=log10_total_features_by_counts)) + 
  geom_point(size=0.2,alpha=0.6) + theme_classic() + 
  scale_colour_gradient(low="lightblue",high="red") +
  guides(color = guide_legend(override.aes = list(size=3)))
gp1

gp1 = ggplot(df_hvg, aes(HVG_TSNE1,HVG_TSNE2,col=part_cell_id)) + 
  geom_point(size=0.2,alpha=0.6) + theme_classic() + 
  guides(color = guide_legend(override.aes = list(size=3)))
gp1

reducedDims(sce_hvg) = SimpleList(PCA=pca, TSNE=tsne$Y)
sce_hvg
## class: SingleCellExperiment 
## dim: 2481 26414 
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(2481): ADAMTS17 ABCG1 ... ZRANB2 ZNF536
## rowData names(18): hgnc_symbol ensembl_gene_id ... total_counts
##   log10_total_counts
## colnames(26414): Ex3e Ex2 ... Microglia.315 Microglia.316
## colData names(38): sample_name part_cell_id ...
##   pct_counts_in_top_200_features_Ri
##   pct_counts_in_top_500_features_Ri
## reducedDimNames(2): PCA TSNE
## spikeNames(0):

5 Clustering

5.1 Kmeans

Next we cluster the cells using a simple kmeans method on the top 50 PCs. We set the number of clusters to be 25, the number of clusters that they have found in (Wang et al. 2018).

date()
## [1] "Sun Feb 24 18:08:49 2019"
k25_50_pcs = kmeans(reducedDim(sce_hvg, "PCA"), centers=25,
                    iter.max = 1e8, nstart = 250,
                    algorithm="MacQueen")
date()
## [1] "Sun Feb 24 18:11:39 2019"
names(k25_50_pcs)
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
dim(k25_50_pcs$centers)
## [1] 25 50
df_tsne$cluster_kmean = as.factor(k25_50_pcs$cluster)

gp1 = ggplot(df_tsne, aes(X1,X2,col=cluster_kmean)) + 
  geom_point(size=0.2,alpha=0.6) + theme_classic() + 
  guides(color = guide_legend(override.aes = list(size=3)))

gp1

Finally we save the sce object and the clustering results.

saveRDS(sce,file.path(psychENCODE_dir,"sce.rds"))
saveRDS(sce_hvg,file.path(psychENCODE_dir,"sce_hvg.rds"))
saveRDS(k25_50_pcs,file.path(psychENCODE_dir,"k25_50_pcs.rds"))

6 Session information

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] Rtsne_0.15                  svd_0.4.1                  
##  [3] limma_3.38.3                scran_1.10.2               
##  [5] scater_1.10.1               ggplot2_3.1.0              
##  [7] biomaRt_2.38.0              DropletUtils_1.2.2         
##  [9] SingleCellExperiment_1.4.1  SummarizedExperiment_1.12.0
## [11] DelayedArray_0.8.0          BiocParallel_1.16.6        
## [13] matrixStats_0.54.0          Biobase_2.42.0             
## [15] GenomicRanges_1.34.0        GenomeInfoDb_1.18.1        
## [17] IRanges_2.16.0              S4Vectors_0.20.1           
## [19] BiocGenerics_0.28.0         data.table_1.12.0          
## [21] BiocInstaller_1.30.0        rmarkdown_1.11             
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6             bit64_0.9-7             
##  [3] progress_1.2.0           httr_1.4.0              
##  [5] dynamicTreeCut_1.63-1    tools_3.5.2             
##  [7] R6_2.3.0                 irlba_2.3.3             
##  [9] KernSmooth_2.23-15       HDF5Array_1.10.1        
## [11] vipor_0.4.5              DBI_1.0.0               
## [13] lazyeval_0.2.1           colorspace_1.4-0        
## [15] withr_2.1.2              tidyselect_0.2.5        
## [17] gridExtra_2.3            prettyunits_1.0.2       
## [19] bit_1.1-14               compiler_3.5.2          
## [21] BiocNeighbors_1.0.0      labeling_0.3            
## [23] scales_1.0.0             stringr_1.4.0           
## [25] digest_0.6.18            XVector_0.22.0          
## [27] pkgconfig_2.0.2          htmltools_0.3.6         
## [29] rlang_0.3.1              RSQLite_2.1.1           
## [31] DelayedMatrixStats_1.4.0 bindr_0.1.1             
## [33] dplyr_0.7.8              RCurl_1.95-4.11         
## [35] magrittr_1.5             GenomeInfoDbData_1.2.0  
## [37] Matrix_1.2-15            Rcpp_1.0.0              
## [39] ggbeeswarm_0.6.0         munsell_0.5.0           
## [41] Rhdf5lib_1.4.2           viridis_0.5.1           
## [43] stringi_1.2.4            yaml_2.2.0              
## [45] edgeR_3.24.3             zlibbioc_1.28.0         
## [47] rhdf5_2.26.2             plyr_1.8.4              
## [49] grid_3.5.2               blob_1.1.1              
## [51] crayon_1.3.4             lattice_0.20-38         
## [53] cowplot_0.9.4            hms_0.4.2               
## [55] locfit_1.5-9.1           knitr_1.21              
## [57] pillar_1.3.1             igraph_1.2.3            
## [59] reshape2_1.4.3           XML_3.98-1.17           
## [61] glue_1.3.0               evaluate_0.13           
## [63] gtable_0.2.0             purrr_0.3.0             
## [65] assertthat_0.2.0         xfun_0.4                
## [67] viridisLite_0.3.0        tibble_2.0.1            
## [69] AnnotationDbi_1.44.0     beeswarm_0.2.3          
## [71] memoise_1.1.0            bindrcpp_0.2.2          
## [73] statmod_1.4.30

Reference

Lake, Blue B., Song Chen, Brandon C. Sos, Jean Fan, Gwendolyn E. Kaeser, Yun C. Yung, Thu E. Duong, et al. 2018. “Integrative Single-Cell Analysis of Transcriptional and Epigenetic States in the Human Adult Brain.” Nature Biotechnology 36 (1): 70–80. doi:10.1038/nbt.4038.

Lun, Aaron T. L., Karsten Bach, and John C. Marioni. 2016. “Pooling Across Cells to Normalize Single-Cell RNA Sequencing Data with Many Zero Counts.” Genome Biology 17 (1): 75. doi:10.1186/s13059-016-0947-7.

Wang, Daifeng, Shuang Liu, Jonathan Warrell, Hyejung Won, Xu Shi, Fabio C. P. Navarro, Declan Clarke, et al. 2018. “Comprehensive Functional Genomic Resource and Integrative Model for the Human Brain.” Science 362 (6420): eaat8464. doi:10.1126/science.aat8464.